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Abstract 


Path integral (PI) control problems are a restricted class of non-linear control 
problems that can be solved formally as a Feyman-Kac path integral and can be 
estimated using Monte Carlo sampling. In this contribution we review path integral 
control theory in the finite horizon case. 

We subsequently focus on the problem how to compute and represent con¬ 
trol solutions. Within the PI theory, the question of how to compute becomes the 
question of importance sampling. Efficient importance samplers are state feedback 
controllers and the use of these requires an efficient representation. Learning and 
representing effective state-feedback controllers for non-linear stochaslic conlrol 
problems is a very challenging, and largely unsolved, problem. We show how to 
learn and represent such controllers using ideas from the cross entropy method. 
We derive a gradient descent method that allows to learn feed-back controllers us¬ 
ing an arbilrary parametrisation. We refer to this method as the Path Integral Cross 
Entropy method or PICE. We illustrate this method for some simple examples. 

The path integral control methods can be used to estimate the posterior distri¬ 
bution in latent state models. In neuroscience these problems arise when estimat¬ 
ing connectivity from neural recording data using EM. We demonstrate the path 
integral conlrol method as an accurate alternative to particle filtering. 
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1 Introduction 


Stochastic optimal control theory (SOC) considers the problem to compute an optimal 
sequence of actions to attain a future goal. The optimal control is usually computed 
from the Bellman equation, which is a partial differential equation. Solving the equa¬ 
tion for high dimensional systems is difficult in general, except for special cases, most 
notably the case of linear dynamics and quadratic control cost or the noiseless deter¬ 
ministic case. Therefore, despite its elegance and generality, SOC has not been used 
much in practice. 

In I Fleming and Mitter, 1982| it was observed that posterior inference in a certain 
class of diffusion processes can be mapped onto a stochastic optimal control prob¬ 
lem. These so-called Path integral (PI) control problems | Kappen, 2005| represent a 
restricted class of non-linear control problems with arbitrary dynamics and state cost, 
but with a linear dependence of the control on the dynamics and quadratic control cost. 
For this class of control problems, the Bellman equation can be transformed into a 
linear partial differential equation. The solution for both the optimal control and the 
optimal cost-to-go can be expressed in closed form as a Feyman-Kac path integral. 
The path integral involves an expectation value with respect to a dynamical system. 
As a result, the optimal control can be estimated using Monte Carlo sampling. See 
flTodorov, 20091 [Kappen, 201 1| [Kappen et al., 2012) for earlier reviews and references. 

In this contribution we review path integral control theory in the finite horizon case. 
Important questions are; how to compute and represent the optimal control solution. In 
order to efficiently compute, or approximate, the optimal control solution we discuss 
the notion of importance sampling and the relation to the Girsanov change of measure 
theory. As a result, the path integrals can be estimated using (suboptimal) controls. 
Different importance samplers all yield the same asymptotic result, but differ in their 
efficiency. We show an intimate relation between optimal importance sampling and 
optimal control: we prove a Lemma that shows that the optimal control solution is 
the optimal sampler, and better samplers (in terms of effective sample size) are bet¬ 
ter controllers (in terms of control cost) [ Thijssen and Kappen, 2015) . This allows us 
to iteratively improve the importance sampling, thus increasing the efficiency of the 
sampling. 

In addition to the computational problem, another key problem is the fact that the 
optimal control solution is in general a state- and time-dependent function u{x, t) with u 
the control, x the state and t the time. The state dependence is referred to as a feed-back 
controller, which means that the execution of the control at time t requires knowledge 
of the current state x of the system. It is often impossible to compute the optimal 
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control for all states because this function is an infinite dimensional object, which we 
call the representation problem. Within the robotics and control community, there are 
several approaches to deal with this problem. 


Deterministic control and local linearisation 


The simplest approach follows from the realisation that state-dependent control is only 
required due to the noise in the problem. In the deterministic case, one can compute 
the optimal control solution u{t) - u*{x*{t),t) along the optimal path x*{t) only, and 
this is a function that only depends on time. This is a so-called open loop controller 
which applies the control u(t) regardless of the actual state that the system is at time t. 
This approach works for certain robotics tasks such a grasping or reaching. See for in¬ 
stance HTheodorou et al., 2010| [Schaal and Atkeson, 2010| who constructed open loop 
controllers for a number of robotics tasks within the path integral control framework. 
However, open loop controllers are clearly sub-optimal in general and simply fail for 
unstable systems that require state feedback. 

It should be mentioned that the open loop approach can be stabilised by computing 
a linear feed-back controller around the deterministic trajectory. This approach uses the 
fact that for linear dynamical systems with Gaussian noise and with quadratic control 
cost, the solution can be efficiently computed. Q One defines a linear quadratic control 
problem around the deterministic optimal trajectory x*{t) by Taylor expansion to sec¬ 
ond order, which can be solved efficiently. The result is a linear feedback controller 
that stabilises the trajectory x*{t). This two-step approach is well-known and powerful 
and at the basis of many control solutions such as the control of ballistic missiles or 
chemical plants [ [Stengel, 1993) . 

The solution of the linear quadratic control problem also provides a correction to 
the optimal trajectory x*{f). Thus, a new x*(t) is obtained and a new LGQ problem 
can be defined and solved. This approach can be iterated, incrementally improving the 
trajectory x*{t) and the linear feedback controller. This approach is known as Differ¬ 
ential Dynamic Programming | Mayne, 1966| [Murray and Yakowitz, 1984] or the Iter¬ 
ative LQG method OTodorov and Li, 2005| . In the robotics community this is a popular 
method, providing a practical compromise between stability, non-linearity and efficient 
computation [[Morimoto et al., 2003[[Tassa, 201 1[ [Tassa et ah, 2014| . 


Model predictive control 

The second approach is to compute the control ’at run-time’ for any state that is visited 
using the idea of model predictive control (MPC) [ [Camacho and Alba, 2013| . At each 
time t in state x,, one defines a finite horizon control problem on the interval [f, t + 7} 
and computes the optimal control solution u{s, x^), t < s < t + T on the entire interval. 
One executes the dynamics using u(t, x,) and the system moves to a new state Xt+jt as a 
result of this control and possible external disturbances. This approach is repeated for 
each time. The method relies on a model of the plant and external disturbances, and 
on the possibility to compute the control solution sufficiently fast. MPC yields a state 
dependent controller because the control solution in the future time interval depends 
on the current state. MPC avoids the representation problem altogether, because the 

*For these so-called linear quadric control problems (LQG) the optimal cost-to-go is quadratic in the state 
and the optimal control is linear in the state, both with time dependent coefficients. The Bellman equation 
reduces to a system of non-linear ordinary differential equations for these coefficients, known as the Ricatti 
equation. 
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control is never explicitly represented for all states, but computed for any state when 
needed. MPC is particularly well-suited for the path integral control problems, because 
in this case the optimal control u*(x, t) is explicitly given in terms of a path integral. 
The challenge then is to evaluate this path integral sufficiently accurate in real time. 
[ Thijssen and Kappen, 2015) propose adaptive Monte Carlo sampling that is acceler¬ 
ated using importance sampling. This approach has been successfully applied to the 
control of 10 to 20 autonomous helicopters (quadrotors) that are engaged in coordi¬ 
nated control tasks such as flying with minimal velocity in a restricted area without 
collision or a task where multiple ’cats’ need to catch a mouse that tries to get away 
UGomez et al., 20151 . 


Parametrized solution 


The third approach is to consider a parametrised family of controllers u{t, x||0) and to 
find the optimal parameters 0*. If successful, this yields a near optimal state feedback 
controller for all f, x. This approach is well-known in the control and reinforcement 
community. Reinforcement learning (RL) is a particular setting of control problems 
with the emphasis on learning a controller on the basis of trial-and-error. A sequence 
of states X,, t - 0, dt, 2dt ,..., is generated from a single roll-out of the dynamical 
system using a particular control, which is called the policy in RL. The ’learning’ 
in reinforcement learning refers to the estimation of the optimal policy or cost-to- 
go function from a single roll out ISutton and Barto, 1998|| . The use of function ap¬ 
proximation in RL is not straightforward | Bellman and Dreyfus, 1959 [Sutton, 19881 
[Bertsekas and Tsitsiklis, 19^61 . To illustrate the problem, consider the infinite hori¬ 
zon discounted reward case, which is the most popular RL setting. The problem is 
to compute the optimal cost-to-go of a particular parametrised form; J{x\ff). In the 
non-parametrised case, the solution is given by the Bellman ’back-up’ equation, which 
relates 7(x,) to J{xt+dt) where Xt^+di are the states of the system at time tj + dt, respec¬ 
tively and Xt+dt is related to x, through the dynamics of the system. In the parametrised 
case, one must compute the new parameters 6' of J(xt\0') from J(xt+dt\ff) ■ The problem 
is that the update is in general not of the parametrised form and an additional approxi¬ 
mation is required to find the ff that gives the best approximation. In the RL literature, 
one makes the distinction between ’on-policy’ learning where J is only updated for the 
sequence of states that are visited, and off-policy learning updates J{x) for all states 
X, or a (weighted) set of states. Convergence of RL with function approximation has 
been shown for on-policy learning with linear function approximation (ie. 7 is a linear 


function of 6) [Tsitsiklis and Van Roy, 19971. These authors also provide examples of 
both off-policy learning and non-linear function approximation where learning does 
not converge. 


Outline 

This chapter is organized as follows. In section[2]we present a review of the main ingre¬ 
dients of the path integral control method. 'We define the path integral control problem 
and state the basic Theorem of its solution in terms of a path integral. We then prove 
the Theorem by showing in section [2T[ that the Bellman equation can be linearized by 
a log transform and in section [Z2[ that the solution of this equation is given in terms 
of a Feyman-Kac path integral. In section [231 we discuss how to efficiently estimate 
the path integral using the idea of importance sampling. We show that the optimal 
importance sampler coincides with the optimal control. In section [3] we review the 
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cross entropy method, as an adaptive procedure to compute an optimized importance 
sampler in a parametrized family of distributions. In order to apply the cross entropy 
method, we reformulate the path integral control problem in terms of a KL divergence 
minimization in section[3T|and in section [372l we apply this procedure to the obtain op¬ 
timal samplers/controllers to estimate the path integrals. In section|4]we illustrate the 
method to learn a parametrized time independent state dependent controller for some 
simple control tasks. 

In section|5]we consider the reverse connection between control and sampling: We 
consider the problem to compute the posterior distribution of a latent state model that 
we wish to approximate using Monte Carlo sampling, and to use optimal controls to 
accelerate this sampling problem. In neuroscience, such problems arise, e.g. to esti¬ 
mate network connectivity from data or decoding of neural recordings. The common 
approach is to formulate a maximum likelihood problem that is optimized using the 
EM method. The E-step is a Bayesian inference problem over hidden states and is 
shown to be equivalent to a path integral control problem. We illustrate this for a small 
toy neural network where we estimate the neural activity from noisy observations. 

2 Path integral control 

Consider the dynamical system 

dX{s) = f(s,X{s))ds + g(s,X(s))(u(s,X(s))ds + dWisfj t<s<T (1) 

with X(t) — X. dW{s) is Gaussian noise with E dW{s) = 0, E dW{s)dW(r) - ds6(s — r). 
The stochastic process IT(s), t < s < T is called a Brownian motion. We will use upper 
case for stochastic variables and lower case for deterministic variables, t denotes the 
current time and T the future horizon time. 

Given a function u{s,x) that defines the control for each state x and each time 
t < s <T, define the cost 

-I- J u{s,X{s))dW{s) (2) 

with f, X the current time and state and u the control function. The stochastic optimal 
control problem is to find the optimal control function u: 

J(t, x) - min E S (t, x, u) 

U 

u*it, x) - arg min E S (f, x, u) (3) 

U 

where E is an expectation value with respect to the stochastic process Eq. [T]with initial 
condition Xt - x and control u. 

J{t, x) is called the optimal cost-to-go as it specifies the optimal cost from any inter¬ 
mediate state and any intermediate time until the end time t - T. Eor any control prob¬ 
lem, J satisfies a partial differential equation known as the Hamilton-Jacobi-Bellman 
equation (HJB). In the special case of the path integral control problems the solution is 
given explicitly as follows. 


S(t,x, u) = <1>(X(7’))- 


r 
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Theorem 1. The solution of the control problem Eqs. |5]w given by 


(4) 

(5) 


J{t, x) 
u*(t, x) 


where we define 


- log iA(f, x) il/(t, x)^E e-^ 


u(t, x) 




ldW\ 1 E 

and lT(s), s > t the Brownian motion. 

The path integral control problem and Theorem [1] can be generalised to the multi¬ 
dimensional case where X{t), f{s,X{s)) are n-dimensional vectors, u{s,X{s)) is an m 
dimensional vector and g{s, X{s)) is an n x m matrix. dW(s) is m-dimensional Gaussian 
noise with E dW{s) = 0 and E dW{s)dW{r) = vds6(s - r) and v the m x m positive 
definite covariance matrix. Eqs.[T]and|2become: 


dX(s) = f{s,X{s))ds + g{s,X{s))(u{s,X{s))ds + dW{s)j 


t < s < T 


S(t,x,u) = - I(h(X(7’)) H- 


r 


V(s,X(s)) + -u(s,X(s)yRu(s,X(s))\ds 


r 


u{s,X(s)yRdW(s) 


(7) 


where ' denotes transpose. In this case, v and R must be related as with AI - Rv with 
d > 0 a scalar | Kappen, 2005) . 

In order to understand this result, we first will derive in section im the HJB equation 
and show that for the path integral control problem it can be transformed into a linear 
partial differential equation. Subsequently, in section lZ^ we present a Lemma that will 
allow us prove the Theorem. 


2.1 The linear HJB equation 

The derivation of the HJB equation relies on the argument of dynamic programming. 
This is quite general, but here we restrict ourselves to the path integral case. Dynamic 
programming expresses the control problem on the time interval [t, T] as an instanta¬ 
neous contribution at the small time interval [f, t + ds] and a control problem on the 
interval [f -i- ds, T], From the definition of J we obtain that J{T, x) - <t>(v), Vx. 

We derive the HJB equation by discretising time with infinitesimal time increments 
ds. The dynamics and cost-to-go become 

Xs+d.'! = JCi + fs{Xs)ds + gs(Xs)(^UsiXs)ds + dWs) s - t,t + ds,... ,T - ds 

T-ds / j \ r -ds 

S,(x,U,:T...ds) = <b(xr) + V H- -H V Us{Xf)dWs 

S=t ' ' S-t 

The minimisation in Eq.j^is with respect to a functions u of state and time and becomes 
a minimisation over a sequence of state-dependent functions Ut-r-ds - s - 
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t,t + ds,... ,t + T — c/i}: 

J,{xi) = min E S ,(x,, u,,T^ds) 



- min Vt{x,)ds + -u,(x,fds + min E S,+ds(Xi+ds, Ut+ds-.T-ds) 


min ( Vt(x,)ds + —ii,(x,)^ds + Jt(x,) + ds(f,(x,) + g,(x,)u,(xt))dxJt(xt) 

III \ 2 


The first step is the dehnition of J,. The second step separates the cost term at time t 
from the rest of the contributions in S t, uses that EJW? = 0. The third step identihes the 
second term as the optimal cost-to-go from time t + ds in state Xt+ds- The expectation 
is with respect to the next future state Xt+ds only. The fourth step uses the dynamics of 
X to express X,+ds in terms of Xt, a hrst order Tayler expansion in ds and a second order 
Taylor expansion in {Xt+ds-Xi and uses the fact that EXt+ds-Xt - {fi{xt)+gt{xt)ut{xi))ds 
and E(X,+rfs - Xt)^ = EdWf + 0{ds^) = ds + 0{ds^). dt,x are partial derivatives with 
respect to t, x respectively. 

Note, that the minimization of control paths u,-T~ds is absent in the hnal result, and 
only a minimization over u, remains. We obtain in the limit ds —> 0: 



( 8 ) 


Eq. |8]is a partial differential equation, known as the Hamilton-Jacobi-Bellman (HJB) 
equation, that describes the evolution of 7 as a function of x and t and must be solved 
with boundary condition J{x, T) - ip{x). 

Since u appears linear and quadratic in Eq. 0 we can solve the minimization with 
respect to u which gives u*(t,x) = -g{t,x)dxJ{t,x). Dehne il/{t,x) = then the 

HJB equation becomes linear in ij/: 



(9) 


with boundary condition iJ/(T, x) - e 

2.2 Proof of the Theorem 

In this section we show that Eq. 0 has a solution in terms of a path integral (see 
[ Thijssen and Kappen, 2015) ). In order to prove this, we hrst derive the following 
Lemma. The derivation makes use of the so-called Ito calculus which we have sum¬ 
marised in the appendix. 

Lemma 2. Define the stochastic processes Y(s),Z(s),t < s < T as functions of the 
stochastic process Eq. |7} 
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When if/ is a solution of the linear Bellman equation Eq.^and u* is the optimal control, 
then 


= f^Z(s)t//(s,X,Xu*(s,X,)-u(s,X,))clW(s) (11) 

Proof. Consider il/{s,X{s)),t < i < T as a function of the stochastic process Eq. [T] 
Since X(s) evolves according to Eq. [l] is also a stochastic process and we can use 
Ito’s Lemma (Eq. |33]to derive a dynamics for if. 

dif - ^dtif + (/ + gu)dxif + - g^d^ij^ ds + gdWdxf - Vifds + g(uds + dW)dxil/ 

where the last equation follows because f satisfies the linear Bellman equation Eq. |9l 
Erom the definition of Y we obtain dY - Yds + ju^ds + udW. Using again Ito’s 
Lemma Eq. [33] 

dZ = -ZdY + ^Zd[Y, Y] = -Z(Vds + udW) 

Using the product rule Eq.|32|we get 


d{Zif) = ifdZ + Zdif + d[Z, if] — —Zif/udW + ZdxifgdW = Zif{u* — u)dW 

where in the last step we used that u* - ^gdxif which follows from u*{t, x) - -git, x)dxJit, x). 
and ifit, x) - (see section inT i. Integrating diZif) from ttoT using Eq.|34|yields 


ZiT)ifiT)-Zit)ilfit,x) 



diZif) 


dsZifiu* — u)dW 


where we used that Z(f) = I and if(T) - exp(-<l>(X(7’))). This proves Eg. ITT] 


□ 


With the Lemma, it is easy to prove Theorem]!] Taking the expected value in Eq.fTTI 
proves Eq.|4] 


ifit,x) =E[e-^^‘’^’“^] 


This is a closed form expression for the optimal cost-to-go as a path integral. 

To prove Eq.|5] we multiply Eq.[TT]with VT(i) = £ dW, which is an increment of 
the Wiener Process and take the expectation value: 


^ E J Zif(iP-u)dwJ^ " f E[Ztf(u* - u)]dr 


where in the first step we used EVT(s) = 0 and in the last step we used Ito Isometry 
Eq. [^ To get u* we divide by the time increment s - t and take the limit of the time 
increment to zero. This will yield the integrand of the RHS if(t, x)(u*(t, x) - u(t, x). 
Therefore the expected value disappears and we get 

u*it, x) = M(f, x) H-lim- E [e“‘^*'’''’"^W(s)l 

if(t, x) sir s - t ^ J 

which is Eq. |5] 
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2.3 Monte Carlo sampling 

Theorem[T]gives an explicit expression for the optimal control u*{t, x) and the optimal 
cost-to-go J{t, x) in terms of an expectation value over trajectories that start at x at time 
t until the horizon time T. One can estimate the expectation value by Monte Carlo 
sampling. One generates N trajectories X{t)i, i - 1,..., starting at x, t that evolve 
according to the dynamics Eq.[T] Then, x) and u*{t, x) are estimated as 


M 



( 12 ) 



(13) 


with S i(t, X, u) the value of S (f, x, u) from Eq. |2]for the /th trajectory X{s)i, lT(i),, t < 
s <T . The optimal control estimate involves a limit which we must handle numerically 
by setting s - t - e > 0. Although in theory the result holds in the limit e —» 0, in 
practice e should be taken a finite value because of numerical instability, at the expense 
of theoretical correctness. 

The estimate involves a control u, which we refer to as the sampling control. The¬ 
orem [1] shows that one can use any sampling control to compute these expectation 
values. The choice of ii affects the efficiency of the sampling. The efficiency of the 
sampler depends on the variance of the weights w, which can be easily understood. If 
the weight of one sample dominates all other weights, the weighted sum over N terms 
is effectively only one term. The optimal weight distributions for samping is obtained 
when all samples contribute equally, which means that all weights are equal. It can be 
easily seen from Lemma |2 that this is obtained when u - u*. In that case, the right 
hand side of Eq. [TT]is zero and thus is S (t, x, u*) a deterministic quantity. This means 
that for all trajectories Xi{t) the value 5,(f, x, u*) is the same (and equal to the optimal 
cost-to-go J{t, x)). Thus, sampling with u* has zero variance meaning that all samples 
yield the same result and therefore only one sample is required. 

One can view the choice of u as implementing a type of importance sampling and 
the optimal control u* is the optimal importance sampler. One can also deduce from 
Lemma|2]that when u is close to u*, the variance in the right hand side of Eq.[TT]as a 
result of the different trajectories is small and thus is the variance in w, = 
is small. Thus, the closes u is to u* the more effective is the importance sampler 
[ Thijssen and Kappen, 2015) . 

Since it is in general not feasible to compute u* exactly, the key question is how to 
compute a good approximation to u*. In order to address this question, we propose the 
so-called cross-entropy method. 

3 The cross-entropy method 

The cross-entropy method [|De Boer et al., 20051 is an adaptive approach to importance 
sampling. Let 2f be a random variable taking values in the space X. Let fvix) be a 
family of probability density function on X parametrized by v and !i(x) be a positive 
function. Suppose that we are interested in the expectation value 



(14) 
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where E„ denotes expectation with respect to the pdf /„ for a particular value of v' = u. 
A crude estimate of / is by naive Monte Carlo sampling from /„; Draw N samples 
Xi,i - . .,N from /„ and construct the estimator 


/ = 



(15) 


The estimator is a stochastic variable and is unbiased, which means that its expectation 
value is the quantity of interest; E„1 = /. The variance of / quantifies the accuracy of 
the sampler. The accuracy is high when many samples give a significant contribution 
to the sum. However, when the supports of /„ and h have only a small overlap, most 
samples Xj from /„ will have h{Xi) » 0 and only few samples effectively contribute to 
the sum. In this case the estimator has high variance and is inaccurate. 

A better estimate is obtained by importance sampling. The idea is to define an 
importance sampling distribution g{x) and to sample N samples from g(v) and construct 
the estimator: 




(16) 


It is easy to see that this estimator is also unbiased; Eg/ - jjYji - ^uKX) - 

1. The question now is to find a g such that / has low variance. When g - fu Eq. M 
reduces to Eq.fTSl 

Before we address this question, note that it is easy to construct the optimal impor¬ 
tance sampler. It is given by 


^ Kx)fu(x) 

g{x)^—r- 

where the denominator follows from normalization: 1 - f dxg*{x). In this case the 
estimator Eq. [T6]becomes / = / for any set of samples. Thus, the optimal importance 
sampler has zero variance and / can be estimated with one sample only. Clearly g* 
cannot be used in practice since it requires /, which is the quantity that we want to 
compute! 

However, we may find an importance sampler that is close to g*. The cross entropy 
method suggests to find the distribution /,, in the parametrized family of distributions 
that minimises the KL divergence 

KL{g*\f,) = J dxg-{x)\og <x -Eg. log/.,(X) (X -E„/r(X)log/.,(X) = -D{v) (17) 

where in the first step we have dropped the constant term Eg. log g*{X) and in the second 
step have used the definition of g* and dropped the constant factor 1 //. 

The objective is to maximize D{v) with respect to v. Eor this we need to compute 
D{v) which involves an expectation with respect to the distribution /„. We can use 
again importance sampling to compute this expectation value. Instead of /„ we sample 
from f„ for some w. We thus obtain 

D(v) ^EJi(X)^^ \ogMX) 
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We estimate the expectation value by drawing N samples from f„. If D is convex and 
differentiable with respect to v, the optimal v is given by 

i y log/v«) = 0 (18) 

The cross entropy method considers the following iteration scheme. Initialize wo = u. 
In iteration n = 0,1,... generate N samples from and compute v' by solving Eq.fTSl 
Set w„+i = V'. 

We illustrate the cross entropy method for a simple example. Consider A’ = R 
and the family of so-called tilted distributions /v(x) = ■^p{x)e'’^, with p{x) a given 
distribution and W = J dxp(x)e''^ the normalization constant. We assume that it is 
easy to sample from /,, for any value of v. Choose u -0, then the objective Eq.[T4]is to 
compute I - J dxp(x)h(x). We wish to estimate I as efficient as possible by optimizing 
V. Eg.fTSlbecomes 

dlogN, ^ ZZi 

dy h(X,)e-»’^' 

Note that the left hand side is equal to EyA" and the right hand side is the ’h weighted’ 
expected X under p. The cross entropy update is to find v such that h-weighted expected 
X equals E,W. This idea is known as moment matching; one finds v such that the 
moments of the left and right hand side, in this case only the first moment, are equal. 


3.1 The Kullback-Leibler formulation of the path integral control 
problem 


In order to apply the cross entropy method to the path integral control theory, we refor¬ 
mulate the control problem Eq.[T]in terms of a KL divergence. Let X denote the space 
of continuous trajectories on the interval [f, T]: r = A'(i), t < s < T with fixed initial 
value X{t) - X. Denote /7 «(t) the distribution over trajectories t with control u. 

The distributions pu for different ii are related to each other by the Girsanov The¬ 
orem. We derive this relation by simply discretising time as before. In the limit 
ds —> 0, the conditional probability of Xs+ds given Xs is Gaussian with mean ps = 
A’i + f{s,Xs)ds + g(s,Xs)uis,Xs)ds and variance Egds = g{s,Xs)^ds. Therefore, the 
conditional probability of a trajectory t = X,-,t\x with initial state X, = x is|l 


T-ds 


'„(t) = lim Pf N{Xs+ds\Ps, Sj) 


(19) 


= Po(t) exp 


~ Jt ^ u(s,X^)g(s,X,) ^(dXs-f(s,X,)ds) 


^In the multi-dimensional case of Eq.[7]this generalizes as follows. The variance is g(5', X 5 )yg(i', X^Yds ■- 
AEsds with Es = g(s, Xs)R~^g(s, XsY and 

P,i(t) = po(r)exp^-^ i/s^u(s,X,yg(s,XX^s'g(^’XMs.X,) 

= po(T)exp^j ds—u(s, X(s))'Ru(s,Xf) + i((i, X(i))'Rc(lV(.9)jj 
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Po{t) is the distribution over trajectories in the absence of control, which we call the 
uncontrolled dynamics. From this we obtain the Radon-Nikodym derivative 


dpoir) 

dpuir) 




j. •y 

ds—u(s,X(s)) 



u{s,X(s))dW(s) 


( 20 ) 


where in the last step we used dynamics Eq.[T] Using Eg.fTOlone immediately sees that 


f dTpu(T)\og^^^ =E„ f ds^u(s,X(s)f 
J Po(t) J, 2 

In other words, the quadratic control cost in the path integral control problem Eq. [2 
can be expressed as a KL divergence between the distribution over trajectories under 
control u and the distribution over trajectories under the uncontrolled dynamics. Eq.[2 
can thus be written as 

J{t,x) = mm J'c/rpjX'rj^log^lll^ + y(T)j (21) 


pT 

with V(t) - ^{Xt) + dsV(s,X(s)). Since there is a one-to-one correspondence 
between u and p„, one can replace the minimization with respect to the functions u in 
Eq.[2T|by a minimisation with respect to the distribution p subject to a normalization 
constraint J drpir) - 1. The optimal solution is given by 

P*(r) = J po(T)exp(-V(T)) (22) 

where i/^(f, x) - is the normalization, which is identical to Eq.Hl Substituting 

p* in Eq.|2T]yields the familiar result J(t, x) = — logi/^(f, x). 

Ea.l22lexr)resses p* in terms of the uncontrolled dynamics po and the control cost. 
It suggests a Monte Carlo sampling scheme that samples from po and weights with 
Erom Eq.|20l we can equivalently express Eq.|22]using importance sampling with 
importance sampling control u as 

P*(r) = exp(-y(T)) = ^ p,i(r) exp(-5 (f, x,«)) (23) 

!/<'(?, x) dpuir) ip{t, x) 


3.2 The cross entropy method for path integral control 


We are now in a similar situation as the cross entropy method. We cannot compute 
the optimal control u* that parametrizes the optimal distribution p* - p,,. and instead 
wish to compute a near optimal control u such that pu is close to p*. Eollowing the CE 
argument, we minimise 


KL{p'‘\pu) oc -Ep. logpfl 


f T 


oc lim E;, 
1 , 


(24) 

\ 


^ -u^{s,X,)ds - u(s,X,)g(s,X,) ^(X,,+d, -X, - f(s,X,)ds) 


>1/(1, x) 


-S (t,x,u) 


(fi|iM(i,2f(i))^ - M(i,2f(i)) ^M(i,2f(i)) + 
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where in the second line we used Eq. |20] and discard the constant term E^. log po and 
in the third line we used Eq.|22to express the expectation with respect to the optimal 
distribution p* controlled by u* in terms of a weighted expectation with respect to an 
arbitrary distribution p controlled by u. We further used that X^+ds = + f{s, Xs)ds + 

g{s,Xs)iu{s,Xs) + dW^)). The expectation of dWs in Eq. |24]is non-zero due to the 
weighting by □ 

The KL divergence Eq.|24]must be optimized with respect to the functions iit j - 
{u{s,Xs), f < s < T}. In addition, the KL divergence involves an expectation value that 
uses a sampling control Ui-j - {u{s, Xs), t < s < T], We are free to choose any sampling 
control as they all are unbiased estimators, but the more the sampling control resembles 
the optimal control, the more efficient can these expecations values be estimated. 

We now assume that ft is a parametrized function with parameters 6. In the time- 
dependent case, we consider different 6s for each of the functions u{s, separately. 
In this case the gradient of the KL divergence Eq.|20]is given by: 


dKL{p*\p) 

dOs 


<p{t, x) 


I^^^Xis)) - u{s,X{s)) ■ 


dWs \ du(s,X{s)) 


ds 


d0s 


<25) 


In the case that u{s, x) and u(s, x) are linear combinations of a set of K basis func¬ 
tions hsk(x) with parameters 6sk and 0^^, respectively, ie. m(s, x) - 2fLi Sskhsk(x) and 
similar for u{t, x), we can set the gradient equal to zero and obtain the set of equations: 


K 


^ [Ssi - 0 ^;) (hsihsk) 
/=! 



t<s<T, k^l,...,K 


(26) 


where we defined (F) - s(t,x,u)p ^ ^ distribution over trajectories under 

control u that is linearly parametrized by (P. Eq.|26]is for each s a system of K linear 
equations with K unknowns 6sk,k = 1,...,K. The statistics (hsihsk) and (^^^hsk'^ 
can be estimated for all times t < s < T simultaneously from a single Monte Carlo 
sampling run using the control ii parametrized by (P. The fixed point equations Ea.l26l 


were derived in | Thijssen and Kappen, 20151 using a different reasoning. 

Although in principle the optimal control explicitly depends on time, there may be 
reasons to compute a control function u{x) that does not explicitly depend on time. Eor 
instance, consider a stabilizing task such as an inverted pendulum. The optimal control 
solution u*{t, x) assumes an optimal timing of the execution of the swing-up. If for 
some reason this is not the case and the timing is off, an inappropriate control u{t, x) 
is used at time t. Another situation where a time-independent solution is preferred is 
when the horizon time is very large, and the dynamics and the cost are also not explicit 
functions of time. The advantage of a time-independent control solution is clearly that 
it requires less storage. 

We thus consider u(Xs) and u{Xs) independent of time parametrised by 0 and (P, 
respectively. In this case the gradient of the KL divergence Eq.|24]is given by: 


dKL(p*\p) 

do 


-'il 


>p( t, x) ^ 


du(X(s)) 

ds{u{X{s))-u{X{s))) 

od 


(27) 


^For the special case of p = p* we have e jt) and the dWs term vanishes. 
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Note the extra integral over s, due to the fact that a single control function is active at 
all times. In the last term, the integration over s has resulted in a ltd stochastic integral. 
This has removed the awkward numerical estimation of (EdW{s)/ds}. 

In the case that u(x) and u(x) are linear combinations of a set of K basis functions 
hkix) with parameters Ok and 6^’, respectively, we can again set the gradient equal to 
zero and obtain the set of equations: 


Z («<-»?) (i dshi(X(s))hkiX(s))j - dWslik(X{s))j k - (28) 

Eq. |28]is a system of K linear equations with K unknowns 9k, k - . .,K. 

If required, the estimations of 9 in Eqs. |^and|28]can be repeated several times, 
each time with an improved 9, u, implementing an adaptive importance sampling algo¬ 
rithm. In iteration n, 9 - 6>„+i is computed using a sampling control parametrized by 

In the case that u does not depend linearly on 9 one cannot directly solve = 

0. In this case one must resort to a gradient descent procedure. In this case, one can 
also include the idea of adaptive importance sampling. Remember that the KL diver¬ 
gence Eq.|24]must be minimized with respect to 0 but also involves a sampling control, 
parametrized by (fi. Since the gradient descent procedure presumably monotonically 
improves the control, it is best to use the most recent control estimate as sampling con¬ 
trol. Setting M = M in the gradients for the time-dependent and time-independent cases 
Eas.l25]and l27l significantly simplifies them and the gradient descent updates become 


^s,n+\ — 9 

0n+l = dn-9 


d KL{p*\p) , 
d9s,„ 
d KL{p*\p) , 
dOn I- 


0s,n + 9 


On +9 


jdWs du(s,X(s))\ 

50.,„ / 

5M(2f(i))\ 
d9„ I 


\ 

r 


dW, 


(29) 

(30) 


respectively, and p > Qa small parameter. Since, Eqs.|29]and|30]are the gradients of the 
KL divergence, their convergence is guaranteed using standard arguments. We refer to 
this gradient method as the Path Integral Cross Entropy method or PICE. 


4 Numerical illustration 

In this section, we illustrate path integral learning for two simple problems. Eor a linear 
quadratic control problem, where we compare the result with the optimal solution, and 
for an inverted pendulum control task where we compute the non-linear state feedback 
controller. 

Consider the finite horizon 1-dimensional linear quadratic control problem with 
dynamics and cost 

dX{s) - u{s,X{s))ds + dW{s) 0 < s < T 

nT 0 

C - E j ds—u^(s,X(s)) + —X{s)^ 

Jo 2 2 

with EdW{s)^ - vds. The optimal control solution can be shown to be a linear feed¬ 
back controller 

u*{s,x) - -R *P(s)x Pis) - y/QR tanh 
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Figure 1: Illustration of PICE Ea.[30lfor a 1-dimensional linear quadratic control prob¬ 
lem with Q - 2,R - l,v = 0.1,7’ = 5. We used time discretization ds - 0.01 and 
generated 50 sample trajectories for each gradient computation all starting from x = 2 
and j] = 0.1. The top left plot shows 0i,2 as a function of gradient desent step. Top 
right shows effective sample size as a function of gradient descent step. Bottom left 
shows optimal cost to go 7 as a function of gradient descent step. Bottom right shows 
50 sample trajectories in the last gradeint descent iteration. 


For finite horizon, the optimal control explicitly depends on time, but for large T the 

optimal control becomes independent of f: u*(x) - - yj^x. We estimate a time- 
independent feed-back controller of the form u(x) - 6 i + 62 X using path integral learning 
rule Eq.|30] The result is shown in fig.[T] 

Note, that 61,62 rapidly approach their optimal values 0, -1.41 (red and blue line). 
Under- estimation of \ 6 i \ is due to the finite horizon and the transient behavior induced 
by the initial value of Xq, as can be checked by initializing Xq from the stationary 
optimally controlled distribution around zero (results not shown). The top right plot 
shows the entropic sample size defined as the scaled entropy of the distribution: ss - 

from Eq.[T2] as a function of gradient desent step, 
which increases due to the improved sampling control. 

As a second illustration we consider a simple inverted pendulum, that satisfies the 
dynamics 

a - — cos O' -I- M 

where a is the angle that the pendulum makes with the horizontal, a - 37r/2 is the 
initial ’down’ position and a - nl 2 is the target ’up’ position, -cosa is the force 
acting on the pendulum due to gravity. Introducing x\ - a,X 2 - d and adding noise. 
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Figure 2; Illustration of gradient descent learning Eq. [30] for a second order inverted 
pendulum problem with Qi - 2,Q2 - 0.02,R - l,v - 0.3, T - 5. We used time 
discretization = 0.1 and generated 500 sample trajectories for each gradient compu¬ 
tation all starting from (xi, X 2 ) = (-nl2, 0) + (0,0.02) and 77 = 0.4, K\ - 20, K 2 = 40. 
Left: Entropic sample size versus importance sampling iteration. Middle: Optimal cost 
to go versus importance sampling iteration. Right: Optimal control solution m(xi , X 2 ) 
versus xi, X 2 with 0 < xi <2n and -2 < X 2 < 2 . 


we write this system as 

dXi{s) = fi{X{s))ds + gi(u{s,X{s) + dW{s)) 0<s<T, i = l,2 
/l(x) = X2 

fiix) = -cosxi 

^ = ( 0 , 1 ) 

nT R 0 0 

C = e| 7/i-M(i,X(i))^-H^(sinXi(i)-l)^H-^X2(i)^ 

Jo 2 2 2 

with EfifWj = vds and v the noise variance. 

We estimate a time-independent feed-back controller on a grid ki = 1 : A"!, ^2 = 1 : 

K2, 


m(xi,X2) = X,- + (ki - l)dxi < Xi < X,- + kidxi, i — 1,2 

with xf the maximum and minimum value of x,- and dx, - (x^ - xJ)IKi. The results 
of the path integral learning rule Eq. [30] are shown in fig. [2] Eig. |2l.^eft shows that 
the effective sample size increases with importance sampling iteration and stabalizes 
to approximately 80 %. Eig.|2]VIiddle shows the optimal cost-to-go decreases with im¬ 
portance sampling iteration. The fluctuation are due to the finite constant learning rate 
r\ and the sampling approximation of the expectation value in the gradient computa¬ 
tion. Eig. |2jlight shows the solution after 1000 importance sampling iterations in the 
(xi,X 2 ) plane. White star is initial location ( 37 r/ 2 ,0) (pendulum pointing down, zero 
velocity) and red star is the target state x = ( 7 r/ 2 , 0 ) (pendulum point up, zero velocity). 
There are two example trajectories shown. The red trajectory forces the particle with 
positive velocity towards the top, and the blue solution forces the particle with negative 
velocity towards the top. Note the green NE-SW ridge in the control solution around 
the top. These are states where the position deviates from the top position, but with 
a velocity directed towards the top. So in these states no control is required. In the 
orthogonal NW-SE direction, control is needed to balance the particle. This example 
shows that the learned state feedback controller is able to swing-up and stabilize the 
inverted pendulum. 


16 
















It should be noted that the use of the path integral method for stabilizing stochastic 
control task is challenging, as is evident from the large fluctuations despite the large 
number of samples for these relatively small problems. The reasons are the following. 

• The weights of the trajectories are proportional to with 5 oc l/d from Eq. [T] 
and A - Rv playing the role of temperature. Small A has the effect that the 
effective sample size is small (close to one sample), because the weight of one 
trajectory dominates all other trajectories. Thus, in order to have a large effective 
number of samples one cannot choose v too small, meaning that the stochastic 
disturbances will be relatively large which make the problem harder to control. 
In order to control these, the control should be sufficiently large, meaning that R 
should be small. But R cannot be chosen too small either since it affects the effec¬ 
tive sample size in the same way as v. This problem is due to the log transform 
that is used to linearize the Bellman equation. 

• No matter how complex or unstable the problem, if the control solution ap¬ 
proaches the optimal control sufficiently close, the effective sample size should 
reach 100 %. Representing the optimal control solution exactly requires in gen¬ 
eral an infinitely large model, except in special cases where a finite dimensional 
representation of the optimal control is known. An infinite model requires in¬ 
finitely many samples to avoid overfitting. Less than maximal entropic sample 
size is thus also due to the finite dimensionality of the model. 

This suggests that the key issue for the succesful application of the path integral method 
is the parametrization that is used to represent u. This representation should balance the 
two conflicting requirements of any learning problem: 1) the parametrization should be 
sufficiently flexible to represent an arbitrary function and 2) the number of parameters 
should be not too large so that the function can be learned with not too many samples. 

The inverted pendulum can of course also be controlled using other methods, for 
instance using the iterative LQG. One first solves the deterministic control problem 
in the absence of noise and then computes a linear feedback controller around this 
solution. In that case the solution is ’unimodal’, representing one of the two possible 
swing-up solutions, and time-dependent. The point of the simulation is to illustrate 
that it is in principle possible to learn any state feedback controller, such as the ’multi¬ 
modal’ control solution that represents both solutions simultaneously. 


5 Bayesian system identification: potential for neuro¬ 
science data analysis 

We have shown that the path integral control problem is equivalent to a statistical es¬ 
timation problem. We can use this identity to solve large stochastic optimal control 
problems by Monte Carlo sampling. We can accelerate this computation by impor¬ 
tance sampling and have shown that the optimal control coincides with the optimal 
importance sampler. In this section, we consider the reverse connection between con¬ 
trol and sampling: We consider the problem to compute the posterior distribution of a 
latent state model that we wish to approximate using Monte Carlo sampling, and to use 
optimal controls to accelerate this sampling problem. 

In neuroscience, there is great interest for scalable inference methods, e.g. to esti¬ 
mate network connectivity from data or decoding of neural recordings. It is common to 
assume that there is an underlying physical process of hidden states that evolves over 
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time, which is observed through noisy measurements. In order to extract information 
about the processes giving rise to these observation, or to estimate model parameters, 
one needs knowledge of the posterior distributions over these processes given the ob¬ 
servations. For instance, in the case of calcium imaging, one can indirectly observe 
the network activity of a large population of neurons. Here, the hidden states represent 
the activity of individual neurons, and the observations are the calcium measurements, 
[Mishchenko et al., 2011]. 

The state-of-the-art is to use one of many variations of particle filtering-smoothing 
methods to estimate the state distributions conditioned on the observations, see [Briers 
et al., 2010, Doucet and Johansen, 2011, Lindsten and Schoen, 2013]. A fundamental 
shortcoming of these methods is that the estimated smoothing distribution relies heav¬ 
ily on the filtering distribution which is computed using particle filtering. For high 
dimensional problems these distributions may differ significantly which yields poor 
estimation accuracy, as seen in the following example. 

One can easily see that the path integral control computation is mathematically 
equivalent to a Bayesian inference problem in a time series model with pq{t) the distri¬ 
bution over trajectories under the forward model Eq. 1 with u - 0, and where one 
interprets the likelihood of the trajectory r = Xr,T under 

some fictitious observation model pCy^lXj) = with given observations y,-T. The 

posterior is then given by p*{t) in Eq. 21. One can generalize this by replacing the 
fixed initial state x by a prior distribution over the initial state. Therefore, the optimal 
control and importance sampling results of section 3.2 can be directly applied. The 
advantage of the PI method is that the computation scales linear in the number of par- 
ticlefl compared to the state-of-the-art particle smoother that scales quadratic in the 
number of particles, although in practice significant accelerations can be made, e.g. 
QEearnhead et al., 2010|[Lmdsten and Schon, 2013| . 

To illustrate this we estimate the posterior distribution of a noisy 2-dimensional fir¬ 
ing rate model given 12 noisy observations of a single neuron, say vi (green diamonds 
in fig. [3]). The model is given by 

dv, 

—— - -Vf + tanh(y *Vt + 6) + crjy„dWi 
dt 

7 is a 2-dimensional antisymmetric matrix and 6* is a 2-dimensional vector, both with 
random entries from a Gaussian distribution with mean zero and standard deviation 
25 and standard deviation 0.75, respectively, and = 0.2. We assume a Gaussian 
observation model N(yi\v\t., cr^(,s) = 0.2. We generate the 12 1-dimensional 

observations y,-, / = 1,..., 12 with vi,, the firing rate of neuron 1 at time f, during one 
particular run of the model. 

We parametrized the control as u{x, t) - A{t)x + b{t) and estimated the 2x2 matrix 
A{t) and the 2-dimensional vector b{f) as described in [Thijssen and Kappen, 2015] and 
Eq.|26] 

The path integral control solution (RPIIS) is shown in fig. O and was computed 
using 22 importance sampling iterations with 6000 particles per iteration. As a com¬ 
parison, the forward-backward particle filter solution (EEBSi) was computed using N 
= 6000 forward and M = 3600 backward particles. In blue, we see the EEBSi es¬ 
timates and in red the RPIIS estimates of the posterior distribution p(vo:r|yo: 7 ')- The 
computation time was 35.1 s and 638 s respectively. 

^The details of this approach are left for a following paper. 


18 








Figure 3: Comparison of path integral control (RPIIS) and the forward filter backward 
smoother (FFBSi cf. [Lindsten and Schoen, 2013]) for a 2-dimensional neural network, 
showing mean and one standard deviation of the marginal posterior solution for both 
methods. 


Figure |4] shows the estimated control parameters used for the RPIIS method. The 
open loop controller bi{t) steers the particles to the observations. The feedback con¬ 
troller An (f) ’stabilizes’ the particles around the observations (blue lines). Due to the 
coupling between the neurons, the non-observed neuron is also controlled in a non¬ 
trivial way. To appreciate the effect of using a feedback controller, we compared these 
results with an open-loop controller m(x, f) = b{t). This reduces the ESS from 60 % for 
the feedback controller to around 29 % for the open loop controller. The lower sam¬ 
pling efficiency increases the error of the estimations, especially the variance of the 
posterior marginal (not shown). When choosing the importance sampling controller, 
there is in general a trade off between accuracy and the computational effort involved 
in the update rules in Eas.l26lorl28] 

The example shows the potential of adaptive importance sampling for posterior 
estimation in continuous state-space models. A publication with the analysis of this 
approach for high dimensional problems is in preparation. This can be used to accel¬ 
erate maximum likelihood based methods to estimate, for instance connectivity, de¬ 
coding of neural populations, estimation of spike rate functions and, in general, any 
inference problem in the context of state-space models; see [Oweiss, 2010] and refer¬ 
ences therein] for a treatment of state-space models in the contex of neuroscience and 
neuro-engineering. 


6 Summary and discussion 

The original path integral control result of Theorem [1] expresses the optimal control 
u*{t, x) for a specific f, x as a Eeynman-Kac path integral. The important advantage 
of the path integral control setting is that, asymptotically, the result of the sampling 
procedure does not depend on the choice of sampling control. The reason is that the 
control used during exploration is an importance sampling in the sense of Monte Carlo 
sampling and any importance sampling strategy gives the same result asymptotically. 
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Figure 4: Control parameters; Left: Open-loop controller Z7,(f), i -1,2; Right: Diago¬ 
nal entries of feedback linear controller A„(f), t - 1,2 


Clearly, the efficiency of the sampling depends critically on the sampling control. The¬ 
orem [1] can be used very effectively for high dimensional stochastic control problems 
using the Model Predictive Control setting [Gomez et al., 20151 . 

However, Theorem [1] is of limited use when we wish to compute a parametrized 
control function for all f, x. We have therefore here proposed the cross entropy argu¬ 
ment, originally formulated to optimize importance sampling distributions, to find a 
control function whose distribution over trajectories is closest to the optimally con¬ 
trolled distribution. In essence, this optimization replaces the original KL divergence 
KL{p\p*) Eq. |2T]by the reverse KL divergence KL{p*\p) and optimizes for p. The 
resulting path integral learning method provides a flexible framework for learning a 
large class of non-linear stochastic optimal control problems with a control that is an 
arbitrary function of state and parameters. The idea to optimize this reverse KL diver¬ 
gence was earlier explored for the time-dependent case and linear feedback control in 
[Gomez et al., 2014) . 

We have restricted our numerical examples to parametrizations that are linear in the 
parameters. Generalization to non-linear parametrizations, such as for instance (deep) 
neural networks, Gaussian processes or other machine learning methods can be readily 
considered, at no significant extra computational cost. 

[De Boer et al., 2005) also discuss the application of the CE method to a Markov 
decision problem (MDP), which is a discrete state-action control problem. The main 
differences with the current paper are that we discuss the continuous state-action case. 
Secondly, the MDP problem is formulated as an optimization problem to find x* - 
argmax^/(x). |D| Boer et al., 2005[ provide a generic approach to apply the CE method 
to optimization, by defining a distribution p(x) and optimise the expected cost C = 
Tix with respect to p. By construction, the optimal p is of the form n{x) - 

6x,x’, is- ^ distribution that has all its probability mass on the optimal state [f|. The 
CE optimization computes this optimal zero entropy/zero temperature solution start¬ 
ing from an initial random (high entropy/high temperatue) solution. As a result of this 
implicit annealing, it has been reported that the CE method applied to optimization 

^Generalizations restrict p to a parametrized family p(x\ff) and optimize with respect to 6 instead of p 
direction IMannor et al., 2003) . 
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suffers from severe local minima problems [Szita and Lorincz, 20061 . An important 
difference for the path integral control problems that we discussed in the present paper 
is the presence of the entropy term p{x) log p{x) in the cost objective. As a result, the 
optimal p is a finite temperature solution that is not peaked at a single state but has 
finite entropy. Therefore, problems with local minima are expected to be less severe. 

The path integral learning rule Eq.[30]has some similarity with the so-called policy 
gradient method for average reward reinforcement learning [Sutton et ah, 1999] 

= riE„ a) 

a 

where s, a are discrete states and actions, n{a\s, ff) is the policy which is the probability 
to choose action a in state s, and 6 parametrizes the policy. E;^ denotes expectation 
with respect to the invariant distribution over states when using policy n and Q’^ is the 
state-action value function (cost-to-go) using policy n. The convergence of the policy 
gradient rule is proven when the policy is an arbitrary function of the parameters. 

The similarities between policy gradient and path integral learning are that the pol¬ 
icy takes the role of the sampling control and the policy gradient involves an expec¬ 
tation with respect to the invariant distribution under the current policy, similar to the 
time integral in Eq. |^for large T when the system is ergodic. The differences are 
1) that the expectation value in the policy gradient is weighted by Q’’, which must be 
estimated independently, whereas the brackets in Eg. [^involve a weighting with e 
which is readily available; 2) Eq.[30]involves an ltd stochastic integral whereas the pol¬ 
icy gradient does not; 3) the policy gradient method is for discrete state and actions and 
the path integral learning is for controlled non-linear diffusion processes; 4) the policy 
gradient expectation value is not independent of n as is the case for the path integral 
gradients Eas.l25]andl27] 
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A ltd calculus 

Given two diffusion processes. 


dY ^ A(Y)ds + B{Y)dW (31) 

dZ = C{Z)ds + D(Z)dW 

the Ito’s product rule gives the evolution of the product process 

d{YZ) = YdZ + ZdY + d[Y, Z] 

d[Y,Z]^ B{Y)D{Z)ds (32) 

The term in the last line is known as the quadratic covariance. 

Let F{Y) as a function of the stochastic process Y. Ito’s Lemma is a type of chain 
rule that gives the evolution of E; 

dF = dYdyF + ]^d[Y, Y]dlF = UdyF + ds + BdyFdW (33) 
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Putting a process Eq. [3T]in integral notation and taking the expected value yields 
the following 


Y = 
E[Y] = 




BdW 


(34) 

(35) 


The ltd Isometry states that 


J A(Y)dW J B(Y)dW = J E[A(Y)B(Y)]ds 


(36) 
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